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1  Numerical  Productivity  Measures 


1.  Michelle  McElvany  gave  a  talk  at  the  Allied-Signal  Digital  Systems  Technology  Exchange 
Conference  in  June,  1991,  on  the  potential  impact  of  VHDL  on  dependable  distributed  system 
design,  analysis  and  validation. 

2.  Chris  Walter  chaired  a  session  and  served  on  the  program  committee  at  FTCS  in  June,  1991, 
and  will  do  so  for  FTCS  in  June,  1992. 


3.  Chris  Walter  presented  an  invited  paper  at  the  SEI  Fault  Tolerant  Systems  Practitioner’s 
Workshop  in  June,  1991. 
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2  Detailed  Summary  of  Technical  Progress 

The  overall  goal  of  this  project  is  to  develop  theory  for  improved  reliability  modeling  of  systems  with 
mixed  fault  types.  This  also  provides  the  basis  for  formal  methods  to  be  used  in  the  specification, 
design,  construction  and  verification  of  ultra-reliable  multi-processor  systems.  We  assume  that  a 
fault  is  an  anomalous  physical  condition,  the  identified  or  hypothesized  cause  of  an  error,  which 
may  eventually  lead  to  a  failure,  a  loss  of  service. 

In  the  initial  project  phase,  our  goal  is  to  develop  a  hybrid  fault  and  static  reliability  model 
that  addresses  mixed  fault  types.  An  in  depth  study  of  faults,  including  their  sources,  their  man¬ 
ifestations,  and  the  techniques  needed  to  reduce  malicious  fault  effects,  will  then  provide  accurate 
inputs  to  the  hybrid  model.  We  partitioned  our  work  into  three  tasks:  analysis  of  static  reliability 
models,  investigation  of  faults,  and  evaluation  of  the  impact  of  fault  containment  on  fault  effects 
and  on  system  reliability. 

Since  existing  fault  taxonomies  did  not  adequately  address  the  errors  caused  by  faults,  we  devel¬ 
oped  the  hybrid  fault  taxonomy.  We  applied  this  taxonomy  to  existing  static  reliability  models  and 
showed  them  to  be  either  unrealistic  in  their  fault  assumptions  or  too  restrictive  in  the  algorithms 
required  to  tolerate  faults.  As  a  result  of  this  analysis,  we  developed  the  hybrid  fault  model  which 
supports  more  realistic  fault  assumptions  and  fault  tolerance  techniques.  We  also  evaluated  the 
hvbrid  fault  model  as  a  systems  analysis  tool,  indicating  directions  for  further  research.  Details  of 
our  progress  are  reported  below. 

2.1  Hybrid  Fault  Taxonomy 

Based  on  our  survey  of  fault  taxonomies,  described  in  [l],  we  have  developed  the  hybrid  fault  taxon¬ 
omy,  which  describe  faults  based  on  characteristics  of  the  errors  they  cause  and  on  the  techniques 
needed  to  tolerate  them.  The  following  definitions,  including  the  fault  attributes  of  malice  and 
symmetry1 ,  are  essential  to  the  hybrid  fault  taxonomy,  formally  defined  in  §2.1.2. 

2.1.1  Terminology 

The  scope  of  a  fault  refers  to  the  portion  of  the  system  affected  by  that  fault,  also  called  the  fault 
extent.  A  system  is  fault  tolerant  or  tolerates  a  fault  if  the  required  system  services  are  maintained 
in  the  presence  of  a  fault.  Fault  coverage  is  a  measure  of  the  system’s  ability  to  operate  correctly 

‘The  definitions  of  malice  and  symmetry  used  in  [lj  are  equivalent  to  the  more  formal  definitions  appearing  in 
this  report. 
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in  the  presence  of  a  particular  fault  set.  Perfect  coverage  assumes  that  system  fault  toleration 
mechanisms  are  successful  for  all  possible  fault  sets. 

Active  or  dynamic  redundancy  attempts  to  achieve  fault- tolerance  by  fault-detection  alone,  or 
in  conjunction  with  location  and  recovery.  Active  redundancy  techniques  include  hardware  redun¬ 
dancy,  using  duplication  with  comparison,  standby  sparing,  or  a  pair  and  a  spare;  information 
redundancy,  using  data  encoding,  parity  checks,  range  checks,  or  sanity  checks;  and  time  redun¬ 
dancy,  using  checkpointing  and  rollbacks.  [2]  In  static  reliability  modeling,  all  faults  are  assumed 
to  be  permanent;  so,  no  fault  location  or  recovery  is  considered,  and  hardware  sparing  is  not  used. 

Passive  or  static  redundancy  uses  fault  masking  to  hide  the  occurrences  of  faults  and  to  eliminate 
the  effects  of  the  faults,  thus  avoiding  errors.  In  their  simplest  form,  these  techniques  make  no 
attempt  to  detect  the  fault,  much  less  its  source,  and  are  transparent  to  the  user  or  operator. 
Hardware  '•edundancy,  using  3,4  or  iV  hardware  modules  simultaneously,  is  another  commonly 
used  passive  technique.  For  further  details,  see  [2]. 

Passive  redundancy  techniques  can  be  iterative  or  non-iterative.  Iterative  fault  masking  tech¬ 
niques,  such  as  interactive  convergence  [3,  4)  and  interactive  consistency  [4,  5],  require  multiple 
rounds  or  iterations  of  message  exchange  among  participants.  Non-iterative  passive  redundancy 
techniques  require  a  single  round  of  message  exchange.  Fault -tolerant  voting  techniques,  such  as 
majority  and  median,  are  non-iterative  passive  redundancy  techniques  on  which  iterative  passive 
redundancy  techniques  are  often  based. 

Hybrid  redundancy  combines  active  and  passive  redundancy  techniques  to  support  masking  and 
detection  of  faults.  Using  hybrid  redundancy  is  usually  more  expensive  and  complex  than  using 
separate  techniques. 

All  faults  are  either  non-malicious  or  malicious.  A  non-malicious  fault  can  be  detected  using 
an  active  redundancy  technique.  A  malicious  fault  cannot  be  detected  using  active  redundancy 
techniques,  but  requires  masking  using  a  passive  redundancy  technique.  It  should  be  noted  that 
passive  redundancy  techniques  are  capable  of  tolerating  non-malicious  faults  as  well.  The  differ¬ 
ence  between  non-malicious  and  malicious  faults  lies  in  the  requirement  that  passive  redundancy 
techniques  be  used  for  malicious  faults.  Malicious  faults  are  more  severe  than  non-malicious  faults. 

Timing  faults  (6,  7] ,  omission  faults  [6]  and  crash  faults  [8]  are  all  examples  of  non-malicious 
faults.  Faults  which  alter  the  contents  of  a  message  in  a  detectable  way,  such  as  by  violating  a 
range  check  or  a  parity  check,  are  also  non-malicious.  Byzantine  faults  [9]  are  malicious. 

Faults  can  be  either  symmetric  or  asymmetric.  A  symmetric  fault  generates  errors  that  are 
manifested  identically  throughout  the  scope  of  the  fault,  i.e.,  the  portion  of  the  system  or  com¬ 
ponent  affected  by  the  fault.  An  a  symmetric  fault  generates  errors  that  are  manifested  differently 
throughout  fhe  scope  of  the  fault.  Asymmetric  faults  are  more  severe  than  symmetric  faults. 

2.1.2  Hybrid  Fault  Classes 

Based  on  the  previous  definitions,  if  T  denotes  all  possible  faults,  B  all  non-malicious  faults.  M  all 
malicious  faults,  S  all  symmetric  faults,  and  A  all  asymmetric  faults,  then 

T  =  B\JM  =  A|JS, 
with  B  fj  M  =  0,  and  A  f)  *S  =  0. 

We  combined  the  attributes  of  malice  and  symmetry  to  produce  the  four  fault  sets  that  make 
up  the  hybrid  taxonomy.  The  set  of  non-malicious  symmetric  faults,  Bs,  is  given  by  h  Bf)^- 
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The  set  of  non-malicious  asymmetric  faults,  BA,  is  given  by  BA  =  B(^\A.  The  set  of  malicious 
symmetric  faults,  Ms,  is  given  by  Ms  =  Mf]S.  The  set  of  malicious  asymmetric  faults,  MA,  is 
given  by  Ma  =  M  f|  A. 

Definition  1  (Hybrid  Fault  Taxonomy) 

Every  fault  in  T  is  in  exactly  one  of  the  sets  Bs ,  B A,  Ms,  or  MA,  with 

T  =  Bs\JBa\Jms\JMa. 

The  definitions  of  malice  and  symmetry  guarantee  that  that  the  individual  sets  Bs,  BA,  Ms, 
and  Ma,  are  pairwise  disjoint.  The  worst-case,  or  most  severe,  faults  in  T  are  those  in  MA-  Faults 
in  Mis  are  less  severe  than  the  faults  in  MA,  but  are  more  severe  than  faults  in  B. 

2.2  Static  Reliability  Models 

We  used  the  taxonomy  of  §2.1.2  to  classify  common  static  reliability  models,  where  all  faults  are 
assumed  to  be  permanent,  no  fault  isolation  or  repair  is  attempted,  and  fault  coverage  is  perfect. 
The  reliability  of  each  model  is  indicated  along  with  the  number  of  faults  tolerated  by  a  given  system 
under  the  assumption  of  identical  nodes.  For  simplicity,  we  assumed  a  synchronous  message  passing 
system  of  m  components  or  processes,  called  nodes ,  where  the  only  evidence  of  a  faulty  node  is 
an  error  in  a  message  from  that  node.  A  good  node  is  expected  to  collect  information  from  other 
nodes  and  to  arrive  at  a  local  decision  that  is  consistent  with  decisions  of  all  other  good  nodes.3  A 
good  node  may  also  need  to  compute  a  local  value  within  a  prespecified  range  of  the  values  of  other 
good  nodes.  If  we  assume  that  all  the  nodes  are  identical  with  reliability  R,  then  the  reliability  of 
a  system  requiring  a  minimum  of  n  non-faulty  nodes  to  provide  system  services  is  given  in  [10]  by 

R(n  of  m)  =  p  ^  1  -  R)m->.  (1) 

Except  for  the  unified  model,  all  of  the  models  presented  below  assume  that  every  fault  is  potentially 
the  worst-case  fault. 

2.2.1  Non-malicious  faults  (Fg) 

All  faults  are  in  Fg,  where  Fg  =  B  =  Bs  U  and  can  be  detected  by  active  redundancy  techniques. 
Since  all  faults  that  are  assumed  to  occur  can  be  recognized  as  such,  mg  =  fg  - FI  nodes  are  needed 
to  guarantee  correct  operation  in  the  presence  of  fg  faults.  The  reliability  of  this  model  is  given 
by  Equation  i  as  fZ(l  of  mg).  This  model  is  overly  optimistic,  as  it  assumes  that  malicious  faults 
can’t  occur. 

2.2.2  Symmetric  faults  (Fs) 

All  faults  are  assumed  to  be  symmetric,  and  potentially  malicious,  where  Fs  =  S  =  Bs  U  Ms- 
Bv  definition,  passive  redundancy  techniques  are  required  to  mask  such  faults.  However,  since  all 

*If  only  one  of  many  nodes  is  good,  it  may  be  impossible  to  identify  the  correct  node.  Such  identification  is 
beyond  the  scope  of  this  report.  However,  we  do  guarantee  that  the  good  node  always  holds  a  correct  value  or  a 
value  consistent  with  all  other  good  nodes’  values. 
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the  faults  are  symmetric,  non-iterative  passive  redundancy  techniques,  such  as  majority  or  fault- 
tolerant  voting,  are  sufficient  to  mask  any  faults.  Since  a  majority  of  good  values  is  required  to 
compute  an  accurate  value  in  the  presence  of  faults,  a  minimum  of  mj  nodes  is  required  to  tolerate 
fs  malicious  symmetric  faults,  where  ms  =  2  fs  +  1.  A  minimum  of  ns  non-faulty  nodes  is  required 
to  keep  the  system  operational,  with  ns  =  ms  -  |  mj2~l  j,  where  |arj  is  the  largest  integer  not  greater 
than  x.  The  reliability  of  the  system  is  given  by  Equation  1  as  R(ns  of  ms),  with  a  minimum 
of  three  nodes  required  to  tolerate  a  single  symmetric  malicious  fault.  This  model  is  also  overly 
optimistic,  as  it  does  not  handle  asymmetric  non-malicious  faults,  and  assumes  that  Byzantine  or 
asymmetric  malicious  faults  can’t  occur. 

2.2.3  Asymmetric  faults  (Fx) 

When  all  faults  are  asymmetric,  and  potentially  malicious,  with  Fx  =  A  =  Bx  U  Ad -A,  the  non¬ 
iterative  passive  redundancy  techniques  described  earlier  are  no  longer  adequate.  Instead,  iterative 
algorithms  such  as  interactive  convergence  [3,  4]  and  interactive  consistency  [4]- [5],  with  multiple 
rounds  of  message  exchange,  are  required  to  mask  these  faults.  A  minimum  of  mx  processors  is 
required  to  tolerate  fx  faults  in  Fx,  with  mx  =  3  fx  +  1*  So,  under  this  model,  a  minimum  of  four 
nodes  is  needed  to  tolerate  a  single,  potentially  malicious,  asymmetric  fault.  From  Equation  1,  the 
reliability  is  given  by  R(nx  of  mx)  with  nx  —  mx  —  J.  This  model  is  overly  pessimistic, 

because  not  all  faults  are  malicious  asymmetric.  Furthermore,  the  extra  rounds  of  message  exchange 
increase  the  complexity  and  cost  of  the  algorithms  needed  to  tolerate  the  faults  assumed  by  the 
model. 


2.2.4  Mixed  faults — the  unified  model  (Fu) 

Unlike  the  previous  models,  the  unified  model  [11]  is  not  limited  to  a  single  fault  type.  Instead, 
the  three  fault  types  used  in  previous  models  are  supported  in  a  single  model.  Using  the  unified 
model,  we  have  Fu  =  Fg  U  Fs  U  Fx,  giving  Fu  =  F.  The  unified  model  maximizes  the  number  of 
non-malicious  and  symmetric  malicious  faults  which  can  be  tolerated,  under  the  constraint  of  an 
upper  bound  on  the  number  of  malicious  asymmetric  faults. 

The  Z(r)  algorithm,  similar  to  the  Oral  Messengers  ( OM(q ))  algorithm  of  [12]  used  for  up  to  q 
arbitrarily  malicious  faults,  achieves  interactive  consistency  in  the  presence  of  /;,  -2 fx  +  2fs  +  /u 
faults,  when  there  are  at  least  mu  processors,  with  mu  =  fu  +  r  +  l,  and  r  >  /  .  ^broadcast  rounds. 
Because  the  assumption  of  multiple  fault  types  requires  the  use  of  more  than  just  the  simple  R(n 
of  m)  from  Equation  1,  the  reliability  formula  is  more  complex,  and  appears  in  [13]. 

This  model  shows  improvement  over  the  previous  model  for  numbers  of  nodes,  mu,  which  cannot 
be  written  as  2k  4- 1  for  any  integer  k.  For  example,  consider  both  the  OM(l)  and  Z(  1)  algorithms 
with  five  nodes  (mu  =  2k  +  2,  where  k  =  1).  Since  the  OM(l)  alrorithm  treats  all  other  faults  as  if 
they  were  in  Fx,  only  one  fault  of  any  type  is  tolerated.  In  contrast,  with  five  nodes,  the  algorithm 
Z(  1)  tolerates  a  single  fault  in  Fx  and  a  single  fault  in  Fg\  or,  three  faults  in  Fg;  or,  one  fault 
in  Fs  and  one  in  Fg.  Unfortunately,  this  model  applies  only  to  interactive  consistency  algorithms 
using  the  Z(r)  algorithm  with  r  rounds  of  rebroadcast. 
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2.3  The  Hybrid  Fault  Model 

We  developed  the  hybrid  fault  model ,  which  recognizes  mixed  fault  types,  by  combining  the  fault 
models  described  in  §2.2.  The  fault  tolerance  algorithms  used  in  conjunction  with  the  models 
in  §2.2  are  replaced  by  hybrid  algorithms  that  assume  mixed  faults,  as  defined  in  §2.4  and  §2.5. 

The  mixed  fault  secs  used  in  the  hybrid  model  are  combinations  of  the  sets  S5,  B M.s, 
and  ,Vt  4  (listed  in  increasing  order  of  severity)  that  were  defined  in  §  2.1.2.  The  type  of  hybrid 
redundance'  algorithm,  active  or  passive,  required  to  tolerate  all  faults  in  the  assumed  fault  set  is 
indicated.  The  type  of  passive  hybrid  redundancy  algorithm  required,  iterative  or  non-iterative, 
is  a  function  of  the  worst-case  fault  type,  the  data  type,  and  the  application.  When  the  fault  set 
requires  iterative  redundancy,  the  choice  of  a  hybrid  interactive  consistency  or  a  hybrid  interactive 
convergence  algorithm  also  depends  on  the  application  and  data  type.  If  no  hybrid  algorithms 
are  used,  then  the  hybrid  model  reverts  to  a  combination  of  the  non- malicious,  symmetric,  and 
asymmetric  models  described  in  §2.2,  with  no  improvement.  Thus,  associating  the  proper  hybrid 
algorithm  with  the  assumed  system  or  node  fault  set  is  the  key  to  the  hybrid  model,  as  defined 
below. 

Definition  2  (Hybrid  Fault  Model) 

The  hybrid  fault  model  consists  of  three  fault  scenarios  Hg ,  Hms,  Hma,  determined  by  the 
faults  assumed  to  be  tolerated.  The  corresponding  fault  sets  are  given  by  Tg,  T ms,  and  Tma, 
with 

Tms  =  Bs  U  bA  U  -Ms  ,  and 
Tma  =  BsUBaUMs\JMa  =  T. 

Thus,  the  form  of  the  hybrid  fault  model  that  applies  to  a  given  node  or  node  set  is  determined  by 
the  worst-case  faults  that  are  assumed  or  shown  to  occur.3 

Hg:  The  worst-case  faults  can  be  shown  or  assumed  to  be  non-malicious,  and  the  fault  set  is  Tg. 
Hybrid  active  redundancy  algorithms  can  be  used  (  See  §2.4.),  with  mg  processes  needed  to 
tolerate  /a  faults,  where  mg  >  fg  +  1. 

Hms :  The  worst-case  faults  can  be  shown  or  assumed  to  be  malicious  symmetric,  and  the  fault 
set  is  Tms-  Hybrid  non-iterative  passive  redundancy  algorithms,  which  handle  mixed  faults 
must  be  used.  (  See  §2.5.)  A  total  of  mMS  processes  is  needed  to  tolerate  /mb  =  /b  +  2/m, 
faults,  with  ttims  >  fMS  +  1- 

HMA:  The  worst-case  faults  can  be  shown  or  assumed  to  be  malicious  asymmetric.  The  fault 
set  is  Tma  =  T,  which  means  that  all  possible  faults  are  addressed.  An  iterative  passive 
redundancy  algorithm  is  required,  with  the  algorithm  Z(r)  of  the  unified  model  fill  sufficient 
for  interactive  consistency.  If  the  Z(r)  algorithm  is  used,  then  ttima  nodes  can  tolerate  a 
total  of  fMA  -  ?fMA  +  2/m,  +  /s  faults,  with  mMA  >  fMA  4-  r  +  1. 

For  other  hybrid  iterative  passive  redundancy  algorithms,  a  minimum  of  m^A  nodes  is  suf¬ 
ficient  to  tolerate  fMA  ~  3/ ma  +  1fM3  +  /b  faults,  with  mMA  >  fMA  +  1.  Such  hybrid 
interactive  consistency  arid  convergence  algorithms  will  need  to  be  based  on  the  hybrid  fault- 
tolerant  voting  functions  presented  in  §2.5. 

’It  mav  also  be  sufficient  to  demonstrate  extremely  low  probabilities  of  occurrence  for  faults  excluded  from  the 
assumed  fault  set. 
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In  future  work,  we  will  provide  a  reliability  model  for  the  hybrid  fault  model,  compare  it  with 
the  static  reliability  models  presented  in  §2.2,  and  address  the  issue  of  imperfect  fault  coverage. 

2.4  Hybrid  Active  Redundancy  Techniques 

In  the  active  redundancy  techniques  developed  to  support  the  hybrid  fault  model,  we  examine  each 
message  or  value  received  by  a  node  using  an  active  redundancy  technique.  If  a  non-malicious  fault 
is  detected,  such  as  a  framing,  parity,  or  encoding  fault,  a  missing  message,  or  a  range  violation, 
then  we  adopt  a  default  error  or  status  value,  vg ,  as  the  value  received  by  the  node  in  the  message. 
Under  no  circumstances  can  vg  be  an  acceptable  value,  and  it  may  differ  based  on  the  data  types  of 
correct  values.  Without  loss  of  generality,  we  assume  that  the  value  vg  is  greater  than  any  permitted 
numerical  data  value.  We  also  assume  perfect  detection,  as  faulty  nodes  are  not  of  concern.  The 
only  difference  between  hybrid  and  standard  active  redundancy  techniques  is  in  the  specific  action 
taken  to  adopt  a  default  error  value,  vg,  when  a  non-malicious  fault  is  detected. 

2.5  Hybrid  Fault-Tolerant  Voting  Functions 

Passive  redundancy  techniques  usually  employ  voting  functions  designed  to  tolerate  a  predefined 
number  of  faults  based  on  the  total  number  of  values  they  receive.  We  extended  these  functions  to 
accomodate  the  default  error  or  status  value,  vg ,  which  is  the  consequence  of  a  non-malicious  fault 
under  the  hybrid  active  redundancy  techniques  defined  above. 

We  define  the  function  exclude(V )  =  Vg,  which  takes  a  set  of  m  elements,  V  =  {ui,  v2, . . . ,  um}, 
removes  any  error  values,  vg,  from  V,  and  returns  the  set  Vg,  containing  mg  elements.  The  set  £ 
is  the  original  fault  set  T  with  the  fg  non-malicious  faults  in  Tg  removed,  and  mg  —  m  -  fg.  In 
the  absence  of  non-malicious  faults  ( fg  =  0),  V  =  Vg,  as  no  elements  are  excluded.  Hybrid  voting 
functions  are  based  on  the  exclude ()  function. 

2.5.1  Hybrid  Majority  Vote 

A  majority  vote  is  typically  used  by  each  good  node  to  compute  a  common  final  value  for  bimodal 
values  received  from  other  nodes  or  input  sources.  Let  V  be  a  set  of  m  elements,  as  described 
above,  with  Vo  a  default  value,  which  could  potentially  be  taken  by  any  r ,.  Then,  we  have 

majont :,(«cfcufe(V))  =  j  £  °f = 

The  default  value,  vo,  returned  when  no  majority  exists  must  be  defined  a  priori  and  must  be  a 
potentially  correct  value,  to  avoid  introducing  a  fault  into  a  fault-free  scenario.  Since  the  majority 
function  ignores  |  ,yie2r-  |  elements,  the  composite  function  tolerates  up  to  /  faults,  where 

2.5.2  Hybrid  Mean  and  Midpoint 

The  functions  mean  and  midpoint  are  commonly  used  to  average  numerical  data.  However,  in 
their  raw  state  they  are  sensitive  to  extreme  values.  So,  we  defined  fault-tolerant  versions  of  these 
functions  using  the  reduce  function,  where,  if  V  is  a  set  of  m  values  to  be  voted,  then: 

reduce(V,  t)  =  {U}  -  {the  t  largest  and  t  smallest  v*}. 
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To  tolerate  the  most  faults,  t  is  taken  to  be  the  maximum  number  of  faults  tolerable  by  m  elements, 
which  is  t(m)  — 

The  mean  of  m  values  V{  £  V ,  for  i  —  1, . . . ,  m,  is  mean(V)  =  £  V{. 

The  midpoint  of  m  values  rt-  £  V ,  for  i  =  1, _ ,  m,  is  the  mean  of  extrema,  with 

midpotnt(V)  —  i(min^=x,m(uj)  4-  maxi=iim(ux)),  often  called  the  mean  of  medial  extremes  (MME). 

The  hybrid  fault-tolerant  mean  and  hybrid  fault- tolerant  MME  functions  are  achieved  by  ap¬ 
plying  the  mean  and  midpoint  functions  to  restricted  subsets  of  values,  where  the  restriction  first 
removes  the  values  vg  (from  detected  non-malicious  faults)  then  eliminates  the  extrema  from  the 
remaining  elements  using  the  reduce  function.  The  number  of  extrema  eliminated  now  depends  on 
mg,  the  number  of  elements  remaining  after  removing  the  fg  non-malicious  fault  values  vg.  So,  we 
have 


hybrid-fault-tolerant  mean(V)  =  mean(reduce(exclude(V),t(ms))) 
hybrid-fault-tolerant  MME  —  midpoint(reduce(exclude(V),t(me))), 

with  time)  =  I  Each  function  tolerates  a  total  of  /  =  fg  +  [m^rl  j  faulty  elements. 


2.5.3  Hybrid  Median 

The  median  or  median-select  voting  function  returns  the  middle  value  of  an  odd  number  of  ordered 
elements,  or  the  average  of  the  two  middle  values  of  an  even  number  of  ordered  elements.  Since  ex¬ 
treme  values  are  ignored,  the  median  is  inherently  fault-tolerant.  So,  the  hybrid-median  consists  of 
the  medianf )  applied  after  the  excludef)  function.  For  a  set  V  of  m  ordered  values  {uj,  v3, . . . ,  vm}, 
where  vq  <  u,+1, 


hybrid- median(V)  = 


median(exclude(  F))  =  ^-1 1 — — 


where  i  —  1  +  k,  j  =  me  -  k ,  and  k  =  [  mf2~ 1  j .  Since  vq  <  vg  by  definition,  the  excluded  values 
vg  will  be  the  fg  largest  values.  So,  the  elements  remaining  in  Vg  after  application  of  the  exclude 
function  will  be  {uj,  u3, . . . ,  vme}. 

In  future  work,  we  will  develop  the  interactive  convergence  and  interactive  consistency  tech¬ 
niques  (using  algorithms  besides  Z(r))  which  take  advantage  of  the  mixed  fault  types  of  the  hybrid 
fault  model.  The  basis  for  these  algorithms  will  be  the  hybrid  fault-tolerant  voting  functions  pre¬ 
sented  in  this  section. 


2.6  Fault  Classification  and  Containment 

Since  the  fault  sets  B$ ,  Ba,  Ms,  and  Ma,  are  mutually  exclusive  and  collectively  exhaustive, 
we  used  the  hybrid  fault  model  as  the  basis  for  a  simple  fault  classification  and  system  analysis 
algorithm,  the  Hybrid  Fault  Analysis  Algorithm.  We  also  derived  directions  for  future  research  by 
applying  this  algorithm  to  a  few  simple  examples. 

2.6.1  Hybrid  Fault  Analysis  Algorithm  (HFA) 

Assume  there  is  a  single  transmitting  node,  T,  sending  messages  to  k  receiving  nodes,  R„  for 
i  =  0, 1, ....  k,  in  a  set  of  m  nodes.  Without  loss  of  generality,  assume  that  a  single  output  message 
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is  generated  by  T ,  with  the  true  value  given  by  v,  and  that  node  Ri  receives  value  e;.4  If  no  message 
is  received  by  a  receiver  Rj,  then  ej  =  0. 

The  basic  philosophy  in  discerning  potential  fault  malice  is  to  determine  the  worst-case  errors 
produced  in  messages  by  the  faulty  node  and  then  to  decide  what  type  of  active  redundancy 
techniques,  if  any,  can  be  used  to  detect  the  faulty  node.  If  such  a  technique  is  implemented  in 
the  receiving  node,  then  the  fault  is  non-malicious .  Otherwise,  it  is  malicious.  With  the  malice 
of  the  fault  determined,  an  examination  of  the  worst-case  error  values  indicates  fault  symmetry. 
We  consider  the  output  sets  corresponding  to  the  worst  case  errors,  represented  by  {ei, . . .  ,  efc}.  If 
v  -£■  e;  and  e;  =  ej  for  all  i  and  j,  for  all  such  errors,  then  the  worst-case  node  fault  is  symmetric ; 
otherwise,  it  is  asymmetric.  Once  the  type  of  the  worst-case  fault  has  been  determined,  the  hybrid 
fault  model  indicates  the  type  of  algorithm  required  to  tolerate  the  fault  in  that  output,  which  can 
be  compared  to  the  algorithms  actually  implemented. 

2.0.2  HFA  Evaluation  Results 

The  fault  scenarios  below  indicate  the  the  uses  of  the  HFA,  as  well  as  future  work  needed  to  make 
the  algorithm  more  robust. 

Suppose  the  worst-case  fault  were  determined  to  be  malicious  symmetric.  Then,  by  Hms,  a 
passive  redundancy  technique  must  be  implemented  in  the  receivers  to  tolerate  that  fault.  If  the 
system  uses  only  active  redundancy  techniques,  then  the  probability  of  the  worst-case  fault  should 
be  demonstrated  to  be  extremely  low,  or  a  malicious  symmetric  fault  could  represent  a  single  point 
of  failure. 

The  worst  case  faults  are  not  the  only  consideration  of  the  HFA,  as  shown  by  the  following 
example.  Suppose  that  a  fault  /  from  a  sender  is  classified  as  malicious  because  the  range  check 
that  could  detect  /  is  not  implemented  in  the  receiver.  If  the  receiving  node  is  redesigned  to 
implement  a  range  check,  or  any  other  technique  which  will  detect  /,  then  the  malicious  fault  / 
is  transformed  into  a  non-malicious  fault.  Thus,  the  potential  for  fault  transformation  is  indicated 
using  the  HFA. 

It  is  also  possible  for  several  faults  to  be  active  in  T,  producing  errors  which  are  seen  as  a  single 
fault  by  each  non-faulty  node  Ri,  or  which  are  never  seen  by  any  non- faulty  node  Ri,  as  described 
in  the  following  example.  Suppose  that  T  is  faulty,  i.e.,  there  exists  at  least  one  fault  /  in  T. 
Consider  all  possible  output  sets  {ei, . . . ,  e^,}  which  the  node  T  could  produce  in  the  presence  of  /. 
If,  for  all  such  output  sets,  v  =  e<  for  all  i ,  then  no  error  will  be  produced  by  the  faulty  transmitting 
node.  This  does  not  mean  that  T  is  not  faulty,  only  that  the  fault  /  has  been  contained  within 
node  T,  and  the  non-faulty  Ri  cannot  discern  the  fault  in  T  because  no  errors  are  manifested  in  the 
messages  they  receive.  While  precise  fault  classification  improves  reliability  estimates  by  indicating 
v  probable  and  improbable  errors,  this  example  shows  that  the  manner  in  which  faults  are  counted 

is  just  as  important  as  their  classification.  We  will  address  these  issues  in  future  work  on  fault 
containment. 

The  HFA  does  not  address  the  effects  of  the  faulty  node  T  on  different  sets  of  receiving  nodes 
simultaneously,  nor  the  effects  of  several  faulty  nodes.  Furthermore,  at  the  receiver  level,  an  error 
in  a  message  from  T  is  viewed  as  a  single  node  fault,  since  the  one  node  transmitting  is  assumed 
to  be  faulty,  and  not  the  communication  link. 

’We  use  a  single  output  in  this  algorithm  to  represent  the  fact  that  all  nodes  R,  receive  the  same  set  of  information 
from  T  in  a  fault  free  scenario.  In  practice,  the  single  output  is  a  message  or  a  set  of  messages. 
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By  combining  the  HFA  algorithm  with  the  error  manifestation  taxonomy,  presented  in  [l],  we  can 
develop  a  more  robust  algorithm  that  deals  with  these  issues.  Our  future  work  also  will  address  fault 
transformation  and  will  evaluate  and  construct  fault  containment  regions,  i.e.,  system  partitions, 
components  or  nodes  with  independent  failure  probabilities,  that  limit  the  scope  of  faults. 
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